Go to https://github.com/uc-cfss/dataviz for the course site. This contains the course objectives, required readings, schedules, slides, etc.
Enrollment in the course is relatively small (10ish students at last count). The nice thing about having a small class is that I can tailor it to better meet your interests. The first six weeks or so of the course are pretty much set, however in the second half of the course we can customize it more to fit your interests and needs. For that reason, I’d like each of you to go to this issue on the course repo and share your thoughts on what you’d like to learn more about in the second half of the term. I have a tentative schedule to which we certainly can stick, but I am open to modifications if there are topics of interest to a substantial portion of the class.
A visualization is “any kind of visual representation of information designed to enable communication, analysis, discovery, exploration, etc.”1 However what you seek to communicate can vary widely depending on your goals, and therefore effects the type of visualization you should design.
With information visualization, the goal is to visually depict abstract data that has no inherent physical form, as opposed to scientific visualization whereby the data itself are objects (in 1D, 2D, or 3D space). This data can be numerical (continuous or discrete), categorical, temporal, geospatial, text, etc. The purpose is to convey abstract data accurately, reveal the underlying structure in the data, and (potentially) encourage exploration of the data via an interactive element. Importantly, the visualization should also be aesthetically pleasing.
Alternatively, statistical graphics seek to visualize abstract data typically of the quantitative form. The goal is to convey data accurately and reveal the underlying structure, but are generally not explorative and interactive and may not always yield an aesthetically pleasing form.
Scatterplot matrix of the Credit dataset. Source: An Introduction to Statistical Learning: With Applications in R.
Double-time bar chart of crime in the city of San Francisco, 2009-10. Source: Visualizing Time with the Double-Time Bar Chart
Double-time bar chart of crime in the city of San Francisco, 2009-10. Source: Visualizing Time with the Double-Time Bar Chart
ggplot2 language)Information dashboards are popular in business and industry. They visualize abstract data, frequently (though not always) over time. The goal is to convey large amounts of information quickly and identify outliers and trends. The downside is that they can become extremely dense.
Dashboard for student performance. Source: 2012 Perceptual Edge Dashboard Design Competition: We Have a Winner!
Fitbit dashboard. Source: me
Infographics depict abstract data in an effort to be eye-catching and capture attention, and convey information quickly. Unfortunately they are frequently not accurate, do not use space efficiently, and may not encourage exploration of the data.
Extremely sexual sun stroking. Source: The top 10 worst infographics ever created
Source: WTF Visualizations
Informative art visualizes abstract data in an effort to make visualization ambient or a part of everyday life. The goal is to aesthetically please the audience, not to be informative.
At this point in time the theory of bacteria was not widely accepted by the medical community or the public.2 A mother washed her baby’s diaper in a well in 1854 in London, sparking an outbreak of cholera, an intestinal disease that causes vomiting, diarrhea, and eventually death. This disease had presented itself previously in London but its cause was still unknown. Dr. John Snow lived in Soho, the suburb of London where the disease manifested in 1854, and wanted to understand how cholera spreads through a population (an early day epidemiologist). Snow recorded the location of individuals who contracted cholera, including their places of residence and employment. He used this information to draw a map of the region, recording the location of individuals who contracted the disease. They seemed to be clustered around the well pump along Broad Street. Snow used this map to deduce the source of the outbreak was the well, along the way ruling out other causes by noting individuals who lived in the area and did not contract cholera, identifying that these individuals did not drink from the well. Based on this information, the government removed the handle from the well pump so the public could not draw water from it. As a result, the cholera epidemic ended.
This illustration is identifed in Edward Tufte’s The Visual Display of Quantitative Information as one of “the best statistical drawings ever created”. It also demonstrates a very important rule of warfare: never invade Russia in the winter. In 1812, Napoleon ruled most of Europe. He wanted to seize control of the British islands, but could not overcome the UK defenses. He decides to impose an embargo to weaken the nation in preparation for invasion, but Russia refused to participate. Angered at this decision, Napoleon launched an invasion of Russia with over 400,000 troops in the summer of 1812. Russia is unable to defeat Napoleon in battle, but instead waged a war of attrition. The Russian army was in near constant retreat, burning or destroying anything of value along the way to deny France usable resources. While Napoleon’s army maintained the military advantage, his lack of food and the emerging European winter decimated his forces. He left France with an army of approximately 422,000 soldiers; he returned to France with just 10,000.
Charles Minard’s map is a stunning achievement for his era. It incorporates data across six dimensions to tell the story of Napoleon’s failure. The graph depicts:
What makes this such an effective visualization?3
Data maps were one of the first data visualizations, though it took thousands of years after the first cartographic maps before data maps came together.
Split into pairs and assess this graphic.
Before determining the type of visualization to draw, one must first consider the type of data and information to visualize.4 First we identify major types of data, then identify how they can be combined to generate a dataset.
Source: Visualization Analysis and Design. Tamara Munzner, with illustrations by Eamonn Maguire. A K Peters Visualization Series, CRC Press, 2014.
There are five major types of data:
Different types of datasets will contain different types of data.
Source: Visualization Analysis and Design. Tamara Munzner, with illustrations by Eamonn Maguire. A K Peters Visualization Series, CRC Press, 2014.
Source: Visualization Analysis and Design. Tamara Munzner, with illustrations by Eamonn Maguire. A K Peters Visualization Series, CRC Press, 2014.
Tables are the standard dataset type in social science. They resemble spreadsheets, and store data in either a flat or multidimensional table.
A flat table stores data in rows and columns.
A multidimensional table uses multiple keys to uniquely identify each item. For example, longitudinal data (repeated observations of items) may still be stored in a flat table but use two columns (attributes) to uniquely identify each item. Alternatively, data can be stored in a multidimensional array that preserves the multidimensional structure.
Networks are used to specify relationships between two or more items.
A small example network with eight vertices and ten edges. Source: Wikipedia
Organization, mission, and functions manual: Civil Rights Division. Source: U.S. Department of Justice
A tree is a network with a hierarchical structure - each child node has only one parent node pointing to it.
Fields contain attribute values associated with cells. Cells contain measurements or calculations from a continuous domain: theoretically there are an infinite number of values you could measure, so you select a discrete interval from which to sample.
Source: NASA Earth Observatory
For instance, measuring climate change is serious stuff. In order to accurately measure climate change, where do you place your measurement stations?
Geometry datasets specify information about the shape of items with explicit spatial positions. These could be maps, but also include any item like points, one-dimensional lines and curves, two-dimensional surfaces or regions, or three-dimensional volumes. Aside from maps, these types of datasets frequently appear in the physical sciences, but less so in the social sciences.
Source: Visualization Analysis and Design. Tamara Munzner, with illustrations by Eamonn Maguire. A K Peters Visualization Series, CRC Press, 2014.
Attribute types (or variable types) define the different types of data encoded in attributes, and will generally be important to determining how to visually depict these attributes.
Semantics define the real-world meaning of data. Data type defines its structural or mathematical interpretation. For instance, numbers are stored in R as integer or doubles. That is the data’s type. However these numbers can have any number of semantic meanings. Are they days of the month? A person’s age? A zip code?
A key attribute acts as an index that is used to look up the value attributes, so the key must uniquely identify each item. Sometimes a single attribute acts as the key, whereas in higher-dimensional data multiple attributes in combination form the key attributes. In the most basic table, the row number acts as the key attribute.
Munzner defines key attributes as independent variables, while value attributes are dependent variables. I don’t particularly like this definition because depending on the research question, an attribute may serve as a dependent variable or as an independent variable (in statistical terms).
TA Ch 1↩
Drawn from John Snow and the Broad Street Pump↩
Source: Dataviz History: Charles Minard’s Flow Map of Napoleon’s Russian Campaign of 1812↩
Munzner Ch 2↩